Structured Indexing Model for Cross-Language Information Retrieval
نویسندگان
چکیده
In recent digital library systems or World Wide Web environment, parallel corpora are used by many applications (Natural Language Processing, machine translation, terminology extraction, etc.). This paper presents a new cross-language information retrieval model based on the language modeling. The model avoids query and/or document translation or the use of external resources. It proposes a structured indexing schema of multilingual documents by combining a keywords model and a keyphrases model. Applied on parallel collections, a query, in one language, can retrieve documents in the same language as well as documents on other languages. Promising results are reported on the MuchMore parallel collection (German language and English language). RÉSUMÉ. Dans les systèmes récents de bibliothèques numériques ou dans le contexte du Web, les corpus parallèles sont utilisés par de nombreuses applications (traitement du langage naturel, la traduction automatique, extraction de terminologie, etc.). Cet article présente un nouveau modèle de recherche d’information inter-langue basé sur le modèle de langue. Le modèle évite la traduction des requêtes et/ou des documents ainsi que l’utilisation des ressources externes. Il propose un schéma d’indexation structurée des documents multilingues en combinant un modèle de mots-clés et un modèle de phrase-clés. Appliquée sur une collection parallèle, une requête dans une langue, peut récupérer des documents dans la même langue ainsi que des documents dans d’autres langues. Appliqué à la collection parallèle MuchMore (en langue allemande et en langue anglaise), le modèle a montré des résultats prometteurs.
منابع مشابه
Domain-Specific Track CLEF 2005: Overview of Results and Approaches, Remarks on the Assessment Anaalysis
The domain-specific track aims at monoand cross-language information retrieval on structured scientific data. This track studies retrieval in a domain-specific context using two social science databases: The German Indexing and Retrieval Testdatabase (GIRT) (forth version GIRT-4: German/English pseudo-parallel corpus with identical documents) with 302,638 documents in total, and the Russian Soc...
متن کاملTranslation-Based Indexing for Cross-Language Retrieval
Structured queries have proven to be an effective technique for crosslanguage information retrieval when evidence about translation probability is not available. Query execution time is adversely impacted, however, because the full postings list for each translation is used in the computation. This paper describes an alternative approach, translation-based indexing, that improves query-time eff...
متن کاملJapanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach
Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an i...
متن کاملGarnata: An Information Retrieval System for Structured Documents based on Probabilistic Graphical Models
In this paper, Garnata, an information retrieval system for XML documents is presented. This system is specifically designed for implementing Bayesian network-based models for structured documents. We show its architecture and performance from the indexing and the retrieval points of view, coming to the conclusion that the system is flexible and fast.
متن کاملIndexing a web site with a terminology oriented ontology
This article presents a new approach in order to index a Web site. It uses ontologies and natural language techniques for information retrieval on the Internet. The main goal is to build a structured index of the Web site. This structure is given by a terminology oriented ontology of a domain which is chosen a priori according to the content of the Web site. First, the indexing process uses imp...
متن کامل